COST Action IC1207 PARSEME meeting
نویسندگان
چکیده
Dealing with idioms in Natural Language Processing systems is difficult, among other reasons, because their architecture must be conceived in such a way that it should not preclude the processing of both free word combinations and these, more constraint, expressions. On the other hand, many idioms do have syntactic structure, and can undergo several types of formal variation, thus making them hard to identify in a strictly string pattern-matching approach. Furthermore, many of these expressions are ambiguous between a literal (non-idiomatic) and figurative, non-compositional (idiomatic) use, depending of many linguistic and extra-linguistic factors. This paper presents the way (European) Portuguese verbal idioms have been integrated in fully STRING, a hybrid, statistical and rule-based, natural language processing system, and identify several of the problems that had (and some that still have) to be addressed, in order to adequately identify and process idioms in texts. 1. This paper focuses on verbal idioms, e.g. perder a cabeça, lit: ‘lose the head’ (lose one’s head), that is, idiomatic (semantically non-compositional) expressions consisting of a verb and at least one constraint argument slot, for which the overall meaning cannot be calculated from the meaning that the individual elements of the expression would present when used independently, in other contexts (M. Gross 1982, 1996). Extensive lists of verbal idioms, particularly the most frequent ones, have been systematically collected for Portuguese, both the European (Baptista et al. 2004, 2005) and the Brazilian (Vale 2001) varieties, along with their main distributional, syntactic and transformational properties, under the Lexicon-Grammar methodological and theoretical framework (M. Gross 1996). Previous studies have shown that the identification of idioms cannot rely neither on strict pattern-matching techniques (Fernandes e Baptista 2007, 2008), nor the use of association measures suffices to identify many idioms (Baptista et al. 2010), hence much manual development of language resources by linguists is required. In this paper, we address the main issues raised in the process of integrating the lexicongrammar of European Portuguese verbal idioms into a fully-fledged natural language processing system, STRING (Mamede et al. 2012). In order to do so, we briefly present the system in the next section. 2. STRING (string.l2f.inesc-id.pt) is a hybrid statistical and rule-based natural language processing chain for Portuguese, with a modular structure, that performs all the basic NLP tasks in four main steps: (i) preprocessing and lexical analysis, (ii) rule-based and (iii) statistical part-of-speech (POS) disambiguation and (iv) parsing. The parsing step is performed by the Xerox Incremental Parser (Ait-Moktar et al. 2002), using a rule-based Portuguese grammar jointly developed by the INESC-ID Lisboa and Xerox. XIP first delimits the elementary phrases (or chunks, like NP, PP, etc.), and then it extracts the dependencies between the chunk’s heads; e.g. SUBJect, MODifier, CDIR (direct complement), etc. 3. Considering that idioms have a syntactic structure, STRING’s strategy consists in parsing them first as ordinary sentences and only then to identify the word combinations whose meaning is not to be calculated in a compositional way, based on the results of the previous parsing. The idioms are identified by the dependency FIXED, which take as its arguments the verb and the frozen elements of the idiomatic expression (the number of arguments depends
منابع مشابه
PARSEME Survey on MWE Resources
This paper summarizes the preliminary results of an ongoing survey on multiword resources carried out within the IC1207 Cost Action PARSEME (PARSing and Multi-word Expressions). Despite the availability of language resource catalogs and the inventory of multiword datasets on the SIGLEX-MWE website, multiword resources are scattered and difficult to find. In many cases, language resources such a...
متن کاملParsing and MWE Detection: Fips at the PARSEME Shared Task
Identifying multiword expressions (MWEs) in a sentence in order to ensure their proper processing in subsequent applications, like machine translation, and performing the syntactic analysis of the sentence are interrelated processes. In our approach, priority is given to parsing alternatives involving collocations, and hence collocational information helps the parser through the maze of alterna...
متن کاملCOST 296 Action: Mitigation of Ionospheric Effects on Radio Systems (MIERS)
1. Welcome BZ, the local host, welcomed the participants, and explained the logistical arrangements. AB (COST 296 Chairperson) thanked BZ for hosting this meeting, everyone for coming and wished us all a good meeting. 2. Approval of the Agenda The Draft Agenda for the meeting was approved with small changes, see ANNEX I 3. Adoption of the Minutes of the fourth MC meeting The minutes of the fift...
متن کاملA data-driven approach to verbal multiword expression detection. PARSEME Shared Task system description paper
Multiword expressions are groups of words acting as a morphologic, syntactic and semantic unit in linguistic analysis. Verbal multiword expressions represent a subgroup of multiword expressions, namely that in which a verb is the syntactic head of the group considered in its canonical (or dictionary) form. All multiword expressions are a great challenge for natural language processing, but the ...
متن کاملReport of the International Society for Zinc Biology 5th Meeting, in Collaboration with Zinc-Net (COST Action TD1304)—UCLan Campus, Pyla, Cyprus
From 18 to 22 June 2017, the fifth biennial meeting of the International Society for Zinc Biology was held in conjunction with the final dissemination meeting of the Network for the Biology of Zinc (Zinc-Net) at the University of Central Lancashire, Cyprus campus. The meeting attracted over 160 participants, had 17 scientific symposia, 4 plenary speakers and 2 poster discussion sessions. In thi...
متن کامل